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Abstract: Image reconstruction in fluorescence optical tomography is 
a three-dimensional nonlinear ill-posed problem governed by a system 
of partial differential equations. In this paper we demonstrate that a 
combination of state of the art numerical algorithms and a careful hardware 
optimized implementation allows to solve this large-scale inverse problem 
in a few seconds on standard desktop PCs with modern graphics hardware. 
In particular, we present methods to solve not only the forward but also the 
non-linear inverse problem by massively parallel programming on graphics 
processors. A comparison of optimized CPU and GPU implementations 
shows that the reconstruction can be accelerated by factors of about 15 
through the use of the graphics hardware without compromising the 
accuracy in the reconstructed images. 
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1. Introduction 

In this paper, we study an indirect measurement technology, namely fluorescence optical to- 
mography, which relies on the ability of fluorescent dyes (fluorophores) to absorb and re -emit 
light at different wavelengths. The aim of fluorescence tomography is to obtain the distribution 
of a fluorophore inside an object (an only indirectly accessible quantity) from optical measure- 
ments at the boundary (an easily obtainable information). The strength of this imaging modality 
is the availability of a large number of fluorophores which change their optical properties in de- 
pendence of its biological surrounding. Thus, they reflect physiologically interesting parameters 
like the pH value [1,2] or tissue oxygenation [3]. 

Unfortunately, photons are massively scattered in biological tissues. Therefore, fluorescence 
tomography requires the accurate simulation of light propagation in highly scattering media 
and the solution of ill-posed nonlinear inverse problems, which are computationally intensive 
tasks. Fast and reliable numerical algorithms will crucially determine the success of this new 
imaging modality in biomedical sciences as well as in engineering applications. 

A recent trend is the application of graphics processing units (GPUs) in scientific compu- 
tations. In the field of optical tomographic imaging, Zhang et al [4] have reported GPU ac- 
celerated bioluminescence reconstructions and make use the proprietary package CULA by 
EM Photonics for the inversion of the finite element system matrix. Prakash et al [5] also use 
CULA from within the Matlab-framework NIRFAST to speed up the reconstruction algorithm 
for diffuse optical tomography. In their work, six distinct tasks are GPU-accelerated such as the 
solution of a linear equation system and matrix-matrix multiplications. 

The previously mentioned works use the GPU as co-processor mainly, i.e. data is trans- 
ferred to the graphics card where certain operations are performed in parallel. The result 
is copied back to the CPU then. This submission is different in the sense that we aim to 
perform nearly all computations on the graphics hardware without the need to transfer in- 
termediate data between the CPU and the graphics card. Thus, the GPU is not only used 
for trivial multiplication tasks but also for the assembly of the forward and adjoint sys- 
tem matrices and the sensitivity matrix, for example. This reduces the comparatively slow 
data transfer between the computer and the graphics hardware to a minimum and results in 
much faster image reconstruction. Furthermore, all algorithms are iterative which circumvents 
the need for matrix inversion algorithms. Our implementation is based on the open-source 
library AGILE [6] which is available at http://www.imt.tugraz.at/research/ 
agile-gpu- image- reconstruction- library/. 

The outline of this paper is as follows: Section 2 provides background material about fluores- 
cence tomography and describes the iterative regularization method that is used for the stable 
solution of the inverse problem. Section 3 deals with the discretization of the problem by finite 
element methods and gives details for an efficient implementation on graphics hardware. In 
Section 4, we define a test problem and derive estimates for the complexity and required mem- 
ory of the reconstruction algorithm. The theoretical results are confirmed by numerical tests, in 
which we compare the performance of our algorithms on CPU and GPU hardware. Our tests 
illustrate that the reconstruction times can be reduced significantly (by a factor of about 15) 
through the use of the graphics hardware. 
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2. Methods 



2.1. Mathematical model 

The propagation of light in highly scattering media, e.g. in biological tissue, is typically mod- 
eled by the diffusion approximation [7], which serves as a first order approximation to the 
radiative transport equation. Since light of two different wavelengths is employed in fluores- 
cence tomography, the mathematical model consists of a coupled system of partial differential 
equations (PDE) [8], namely 

-db/(K x V(p x )+n x <p x = qi inQ, (1) 

-div(jc„,V<p m ) + n, n <p m = yc<p Xl in Q. (2) 

Here, <p,, i = x,m are the photon densities for the excitation (x) and the emission (m) light 
in the domain £1 covered by tissue, and the source term q models a collimated light source. 
The parameters jc, and /i; are the diffusion and absorption coefficients of the tissue, and y 
is a measure incorporating the fluorophore's extinction coefficient and its quantum yield. We 
assume that these parameters depend in a known form on the concentration c of fluorophore in 
the tissue, and we write JQ(c), jtl, (c) or y(c) if this dependence needs to be emphasized. 
Homogeneous Robin boundary conditions [7] 

pcpi + Ki^- = 0, ondQ, i=x,m. (3) 

an 

are used to model that no light enters the domain from the outside apart from the light injected 
through the internal source q. 

The simplest type of boundary measurement is the photon flux at the emission wavelength 
which leaves the domain which is modeled by 



m((p m ) = / d(s)(p m (s)ds 



(4) 



where the function d allows to model the aperture and additional (linear) transfer characteristics 
of the detector. Other types of measurement functions are possible as well and have been used 
in e.g. [9-11]. 

The tomographic measurement process finally consists of illuminating the object with a se- 
quence of light sources qj, j = 1 , . . . , ns, and taking measurements of the emission light with an 
array of detectors d,, i = 1 , . . . , nd at the boundary. This gives rise to a measurement matrix ^# 
of dimension nd x ns, which obviously depends on the distribution c of the fluorophore, and we 
write ^#(c) to express this dependence. 

2.2. Image reconstruction 

The aim of fluorescence tomography is to determine the inaccessible fluorophore distribution 
c which gave rise to the measurements ^ 5 — which are perturbed by measurement noise and 
model errors — of the emitted light. A Tikhonov-type regularization term is incorporated in the 
objective functional for stability reasons. Then the reconstruction problem reads 



argmin t . 



> 2 +|||c- C0 || 2 



To solve this problem in a fast and stable manner, we utilize the iteratively regularized GauB- 
Newton method (IRGN) [12], which leads to the iterative algorithm 

(S* k S k + a k I)(c k+l -c k ) = S* k (^ s -^(c k )) + a k (c 0 -c k ) k = Q,\,.... (6) 
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Here, S/; := j£ denotes the sensitivity operator (i.e. the Jacobian), which measures the ef- 
fect of a change in the fiuorophore distribution c onto the output ./#(c) and S* k is its adjoint. The 
regularization parameter Cfy stabilizes the solution and is usually decreased in every iteration. 

For the sake of completeness, we would like to mention other possibilities for the regular- 
izing of the Newton iteration, e.g. the truncated Newton-CG algorithm [13], the Levenberg- 
Marquardt method [14], or the Newton-Landweber method [15]. Several of these approaches 
have been demonstrated to work well for optical tomography applications [8, 16-19]. Non- 
quadratic penalty terms have been studied e.g. in [20,21]. 

3. Implementation 

In the following, we discuss the the main steps of the reconstruction algorithm in more detail 
and provide information for an efficient implementation. In order to quantify the expected com- 
putational workload, we also derive complexity estimates for the individual steps and provide 
upper bounds for the memory requirements. 

3.1. Discretization 

For the discretization of the governing partial differential equations, we consider a standard 
finite element method with piecewise linear basis functions for the photon fields. In our com- 
putations, we use an unstructured mesh with Ny vertices consisting of Nj tetrahedral elements. 
For computational efficiency but also simplicity, the optical parameters and the fiuorophore 
concentration c are discretized by piecewise constant functions over the same mesh. The dis- 
cretized version of the system (l)-(2) reads 

[K x (c)+M x (c)+R X ]V X = Q (7) 
[K m (c) +M m (c) +R m ]V m = G(c)V x . (8) 

In this linear system, K,, Mj and ; S {x,m}, denote the stiffness-, mass- and boundary mass- 
matrices at the respective wavelength. Each column of the Ny x ns matrix Q corresponds to 
a different excitation qj, j = 1 , . . . , ns, and the ns columns of the solution matrices V x and V m 
store the vector representations of the corresponding photon density fields <p A and <p m . The 
measurement data are given by ^(c) = D J V m , where each of the nd columns of the detector 
matrix D corresponds to one of the detectors di, i = 1, . . . , nd, in (4). 

The discrete sensitivity S = S(c) is an nd x ns x Nt tensor (respectively an (nd ■ ns) x Nt 
matrix). Its action onto a parameter perturbation h € M. Nt is given by 

Sh := -W^K m (h)V m - wZM m (h)V m (9) 
- WjK x (h)V x - wjM x (h)V x + WjG(h)V x , 

where h is an arbitrary element from the Nj dimensional parameter space, and W m , W x denote 
the solutions of the adjoint system 

[K m (c) +M m (c)+R m ]W m = D, (10) 
[K x (c) +M x {c)+R x ]W x = G(c)W m . (11) 

The fully discrete equivalent of the iteratively regularized GauB-Newton method (6) has the 
form listed in Algorithm 1 . 

Before going into details of the implementation and giving complexity estimates for the 
individual step as well as upper bounds for the memory requirements, let us make some general 
remarks: 
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Algorithm 1 Iteratively regularized GauB-Newton algorithm 
1: initialize 

2: C^O 

3: for k — 1 to maxit do 

4: [A x ,A m ,G] <— assembleSystemMatrices(c J t) 
5: V X <-PBCG(A X ,Q) 
6: V m <- PBCG(A m ,GV x ) 

7: M <— D T V m 

8: W„, PBCG(A,„,D) 

9: PBCG(A. r ,GW m ) 

10: Sk 4— asseinbleSensitivity(c J t,V Ar ,V r m ,VK x , W m ) 

11: Jc^- (SjS k + a k iy l Sj{M 5 -M) 

12: C-^C + c/c 
13: ajt+i <— q(Xk 
14: end for 



(i) The presented algorithms are independent of the architecture used for the actual compu- 
tations, i.e. apart from some implementation details, the same algorithms are used for CPU and 
GPU implementations. In our numerical tests, the high-level algorithms are always controlled 
by the CPU, while the actual computations (e.g. matrix-vector multiplications) are executed on 
the CPU or GPU, respectively, inside a few kernels. Only these few kernels have to be adapted 
to the specific hardware. The advantage of such a "minimally invasive" framework is that al- 
gorithms can be migrated gradually from one hardware to another while at the same time the 
high-level abstraction of the algorithms can be preserved. Such an approach seems very natu- 
ral, and has been employed previously, e.g. for the simulation of fluid dynamics on CPU-GPU 
clusters [22]. 

(ii) All compute intensive steps of the reconstruction algorithm have a high degree of local- 
ity: A major part of the operations, e.g. the assembly of the system matrices or the sensitivity, 
can be performed in an element-by-element fashion. This means that similar computations can 
executed for individual elements independently from each other. Global operations, e.g. the so- 
lution of the linear systems, rely on sparse matrix algebra. Note that each row (or column) of the 
sparse matrices corresponds to a vertex of the mesh, and the non-zero entries of this row stem 
from degrees of freedom (vertices) belonging to the patch of elements surrounding this vertex. 
Therefore, also sparse-matrix operations share a high degree of locality, and consequently, the 
overall reconstruction algorithm is very well-suited for parallelization. 

3.2. Initialization 

The first task is to set up a multilevel mesh hierarchy: The vertex coordinates of the finest level 
are loaded, and the transfer operators between adjacent levels / and / + 1 are stored as sparse 
prolongation matrices Pj +l . This allows to lift a function from level / to the next finer level 1 + 1, 
i.e. W;_|_i = Pj +1 ui denotes the linear prolongation of a piecewise linear finite element function 
U\ to the finer mesh where it is represented by m/+i. The mesh hierarchy is only required for 
the initialization step. The memory requirement for storing the prolongation matrices is 
0{N V ). 

The element vertex numbers are stored in an Nj x 4 table containing the global vertex num- 
bers for the Nj individual elements. The 4-element array g T then contains the numbers of the 
vertices spanning the tetrahedron T. During the setup step, also the vertex-to-element connec- 
tivity is required, which is simply given by Ly = (L%) J . The memory requirement for these 
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Fig. 1. (a) Sparse matrix which is to be stored in compressed row-storage (CRS) format, 
(b) The conventional memory layout on the CPU consists of the linearized data elements 
together with their column indices and a vector holding the offset of each row in the data 
vector. The number of elements in a row is given implicitly by the difference in the row 
offsets, (c) The adapted layout for the GPU stores the rows in an interleaved manner which 
requires memory padding marked with x and an additional vector holding the number of 
non-zero entries per row. 



look-up tables is 0(Ny +Nt). 

One additional look-up table is needed for the efficient assembly of the sparse system ma- 
trix. This matrix is stored in compressed row-storage (CRS) format. Its sparsity pattern can be 
created from the mesh hierarchy and the mapping (local index — » global index — » CRS data 
index) can be readily computed. We store the inverse of this mapping, which is a one-to-many 
table and requires 0(Nt) memory. 

Finally, we pre-compute the mass-, boundary-mass- and stiffness-matrices, M T , R T and K T , 
for unit coefficients: 



and R ' j := Lndn VVW VC/)*** 1 ^ ^ 4 - 

Each expression results in Nt matrices of size 4x4, which are copied to the GPU memory. 
3.3. Sparse matrix format 

The solution of linear equation systems heavily relies on fast sparse matrix-vector products. For 
comparison of matrix-vector kernel performance of different sparse storage layouts we refer 
to [23]. The results presented in this paper are based on a blocked interleaved compressed row- 
storage (CRS) format sketched in Fig. 1 . In contrast to the conventional CRS layout, the adapted 
storage scheme stores the elements of the rows interleaved which makes access from different 
threads more efficient. As not all of the rows have the same number of non-zero entries, the 
shorter rows have to be padded with dummy elements. In order not to waste too much memory, 
this padding is done in blocks of 32 rows (a block-size of two is used in Fig. 1 to illustrate the 
technique). Further details about this format can be found in [24]. 
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3.4. Assembly of the system matrices 

When assembling of the system matrix, the individual element contributions have to be 
summed. In a straightforward implementation this would result in non-local random write- 
access on the output matrix which is costly on graphics hardware. However, due to the piece- 
wise constant discretization of the coefficients K, jj., p and c, the contribution of an element T 
to the excitation and emission system is given as 



Therefore, we split the assembly into two steps: First, the pre-computed element matrices K T , 
M T and R T are scaled with the element's optical parameters and the output is stored in a vector 
H with length (A-A-Nt). In the second step, we sum for every position in the CRS matrix 
the corresponding elements from H. In principle, this summation can be done by multiplying a 
suitable sparse matrix with H. However, as all non-zero elements in this matrix are 1, we imple- 
mented a specialized GPU kernel for this task, which simply sums the elements whose indices 
are stored in a look-up table. By using such a two-step scheme, we avoid non-local write-access 
and need non-local read-access only for the second summation step, which is implemented us- 
ing the GPU's texture cache. Figure 2 illustrates this procedure. As the construction of the local 
element matrices does only require local memory access, it can be done fully in parallel which 
provides almost optimal performance on both, CPU and GPU processors. The total complexity 
of this step is 0(Nj). 

3.5. Solution of the linear systems 

For the solution of the linear systems A X V X = Q and A m V m = G(c)V x for the forward problem, 
respectively, A m W m = D and A X W X = G(c)W,„ for the adjoint problem, we employ a precondi- 
tioned blocked conjugate gradient method (PBCG): instead of solving for one right hand side 
(i.e. a column in Q) after another, we solve simultaneously for all right hand sides (blocked). 
The algorithm uses the PCG framework provided by the AGILE library. Thus, only a new kernel 
had to be coded for the blocked matrix-vector-product. 

To accelerate the convergence, a single V-cycle of a geometric multigrid preconditioner [25] 
is used. On the coarsest mesh level, an exact solve is performed by a blocked conjugate gra- 
dient method (BCG), which is similar to PBCG but without further preconditioning. A single 
BCG iteration is also used in the smoothing step. The application of one multigride V-cycle 
has linear complexity. If the right-hand side has ns columns, one application of the multigrid 
preconditioner requires 0(ns-Ny) algebraic operations. 

Note that all basic operations in the conjugate gradient method and the multigrid precon- 
ditioner (e.g. the matrix-vector products or simpler vector operations) can easily be executed 
on CPUs or GPUs, respectively. Due to the preconditioning, the typical iteration numbers re- 
quired for convergence of the PBCG algorithm are 10-15. Therefore, the total complexity of 
the solution of the forward and adjoint problems is 0(ns ■ Ny) and 0(nd -Ny), respectively. 

3.6. Assembly of the sensitivity 

For the assembly of the sensitivity matrix, we use the representation (9). As h is a piecewise 
constant function, the mass- M(h) and stiffness-matrices K(h) are just the element matrices 
M T and K T for the T-th finite element scaled by the derivatives of the optical parameters. Since 
the computations for the individual elements are independent from each other, this loop can be 
executed in parallel. 
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A T X = K x (c T k )K T +^(c T k )M T + Px R T , 
A T m = K m (cl)K T + Hm(cl)M T +p m R T . 



(12) 
(13) 
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Fig. 2. Assembly of the system matrix: The 4x4 stiffness-, mass- and boundary mass- 
matrices K, M, and R of every element are scaled with the optical parameters K", ix and 
p, summed and stored temporarily into a vector T. In a second step, these elements are 
compressed into a sparse matrix A by summing the elements specified by the vertex-to- 
element connectivity which is given as look-up table (LUT). 



The pseudo-code for the GPU kernel used to assemble the sensitivity matrix is listed in 
Algorithm 2. The threads are started in blocks of dimension (x,y,z) = (4 x 4 x 32). The size 
of (4 x 4) is due to the size of the element-wise mass- and stiffness-matrices on a tetrahedral 
mesh. The last dimension is the number of elements which are processed in parallel and has 
been set to 32 to maximize the number of threads per block. 

For readability, all data structures which exist once for every element being processed carry 
a superscript z instead of writing them as an array. Thus, « z is actually an array with 32 entries, 
c/ z [x] is a 2D array with size (32 x 4) and k z (x,y) is a 3D array with size (32 x 4 x 4), for 
example. 

The number of the element which is processed by a thread can be inferred from the index of 
the thread block as shown in line 1. In line 2, g T denotes the element-to-vertex connectivity of 
the T-th finite element, which is cached into the vector d. The algorithm caches the element's 
mass- and stiffness-matrices (lines 4-6) and the forward and ajoint solutions (lines 9-10 and 
12-13). Line 15 is an abbreviation for the computation of the sensitivity contribution according 
to Eq. (9). These contributions are summed in parallel (line 16) and the result is stored in S. 

Assembling the sensitivity matrix requires 0(ns ■ nd-Nj) operations and the same amount of 
memory. Hence, this is the most time consuming and memory intensive step in the current algo- 
rithm. We would also like to note that the assembly of the sensitivity matrices St can be avoided 
completely if it is applied on-the-fly when computing the matrix-vector products; cf [19] and 
the remarks in section 4.6. 
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Algorithm 2 GPU kernel for assembling the sensitivity matrix 



7i z <- 32-blockIctx + z 

synchronize 

£ z (x,y)^" 2 (x,y) 
m z (x,y) ^M" z (x,y) 
r z (x,y)^fl" 2 (x,y) 

for £ = 0 to ns — 1 do 

v z [x]^V r (</ z [x],fc) 
v»[x]<-V m (d» [«],*) 
for / = 0 to nd — 1 do 

w*[y]^W[yy) 

<[y]^W m (^[y],/) 

synchronize 

H z (x, y) = f(k z (x, y ) , m z (x, y) , r z (x, y) , v z [x] , v* [x] 

f z ^X xy tf z (x,y) 
S(/?,n z ) w z 

end for 
end for 



% compute element index 
% cache connectivity 

% cache element matrices 



% cache forward fields 



cache adjoint fields 



% parallel reduce 

% store element's sensitivity 



3.7. Solution of the Gaufl-Newton system 

Formally, the GauB-Newton matrix S* k Sk + a^I is a dense Nt X Nj matrix, which would be 
expensive to compute and almost impossible to store. Since by construction S* k Sk + d^l is sym- 
metric and positive definite, the GauB-Newton system can be solved efficiently by the method 
of conjugate gradients, which requires only the multiplication with and St. In this manner, 
the complexity of multiplication with the GauB-Newton matrix is given by 0(ns ■ nd ■ Nt) alge- 
braic operations, whereas a multiplication with the pre-multiplied matrix StS* + a^I would be 
0(Nt )■ 

For the iterative solution of the GauB-Newton systems we utilize hardware optimized dense 
matrix-vector kernels, i.e. LAPACK-BLAS [26] for the CPU and a custom implementation based 
on AGILE for the GPU version. 

4. Results 

4.1. Model problem 

For our numerical tests, we considered a mouse phantom based on the Digimouse model [27]. 
The measurement setup consisted of ns = 24 sources and nd = 24 detectors, which were ar- 
ranged on three rings around the homogeneous torso. Two spherical fluorophore inclusions 
with a diameter of 5 mm were placed at the height of the middle optode ring inside the body, 
and the aim of our test was to reconstruct these inclusions. The setup is depicted in Fig. 3. 

For our simulations we used realistic optical tissue properties which were gathered from lit- 
erature [28-31], cf Table 1. These parameters were kept constant during simulation and recon- 
struction. In practice, the estimation of those background parameters would require additional 
a-priori knowledge or measurements. 

The data used in our numerical tests were generated on a finer mesh with 200835 elements 
and additionally perturbed by 1 % Gaussian noise relative to the largest measurement datum. 
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Table 1. Optical Parameters Used for the Simulations 





Ms 


AW 


e 




p 




mm 


mm" 1 


mm" 1 


M 1 




excitation 


0.275 


0.036 


8.35- 


10 3 


0.2 


emission 


0.235 


0.029 


2.81 • 


10 3 


0.2 



For the finite element discretization, we used a hierarchy of three nested tetrahedral meshes, 
which were generated with the open source mesh generator NETGEN [32]. The three meshes 
consisted of 2720, 21760, and 174080 elements and 853, 5103, and 34677 vertices, respec- 
tively. 

4.2. Choice of parameters 

The regularization parameters Cfy and the stopping index maxit in the GauB-Newton algorithm 
were chosen as follows: We conducted a series of numerical tests for ten similar phantoms, and 
determined a set of parameters ttk that worked reasonably well for all samples. These param- 
eters were then used for the actual reconstruction. For the results presented here, a geometric 
sequence Cfy. = qOiQ with ocq = 1 and q = 0.2 was used. The GauB-Newton iteration was stopped 
when the regularization parameter reached the pre-defined threshold a„„„ = 1 x 10" 5 , which 
corresponds to maxit = 8 Newton iterations. 

4.3. Numerical results 

All computations were executed on an AMD Phenom 9950 processor. Both, the CPU code 
and the GPU host code (i.e. not the code running on the graphics hardware ifself) are single- 
threaded. The compute intensive operations (e.g. the matrix-vector multiplications and the as- 
sembly of the sensitivity matrix) were realized as compute kernels, which were implemented 
for both, CPU and GPU hardware. For our GPU tests, we used an NVIDIA GTX 480 graphics 
card and NVIDIA's CUDA toolkit [33] for programming. A typical result of the image recon- 
struction is shown in Fig. 4. 

4.4. Performance analysis 

In Table 2, we provide a detailed performance analysis of the reconstruction algorithm and 
compare the computation times of the CPU and GPU implementations. For both architectures 
we also compared the reconstruction with and without multigrid preconditioning. For all rel- 
evant tasks, the execution on the GPU results in a significant speed-up. An exception is the 
assembly of the system matrices on the coarsest level, where the number of elements is so 
small that the acceleration achieved on the graphics card is spoiled by the latency for starting 
the kernels on the GPU. However, the time spent for this part of the computation is negligi- 
ble. The observed speed-up by factors of about 10 is in good agreement to the accelerations of 
sparse matrix-vector products [34]. (Note that the factor 10 is mainly due to the approximately 
10 times larger memory bandwidth provided by GPUs). 

4.5. Accuracy 

Only single-precision arithmetic executes very fast on current commodity graphics hardware, 
i.e. use of double-precision operations results in increased memory requirements and a tremen- 
dous performance drop. In order to obtain insight into the errors introduced by using single- 
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Fig. 3. Simulation setup showing the mouse phantom, the excitation sources (cylinders) 
and the photon-detectors (cones). The cross-section at the height of the central optode ring 
shows the assumed concentration distribution which consists of two 5 mm spheres with a 
fluorophore concentration of 10,11 M. 




Fig. 4. Reconstruction of the two fluorescent inclusions shown in Fig. 3 performed with 
graphics hardware acceleration. The cross-section is again taken at the height of the middle 
optode ring. 1 % noise relative to the largest measurement datum was added for to the 
reconstruction. 



#149380 - $15.00 USD Received 16 Jim 2011; revised 4 Aug 201 1; accepted 16 Aug 201 1; published 28 Oct 2011 
(C) 201 1 OSA 1 November 201 1 / Vol. 2, No. 1 1 / BIOMEDICAL OPTICS EXPRESS 3218 



Table 2. Comparison of Single-Threaded CPU and Parallelized GPU Reconstruction Times 
for the Digimouse Phantom with 174080 Elements 



Task 


CPU 


GPU 


Speed-up 


Assembly of the system matrices 












Compute element contributions 












Mesh level 1 


n.a. 1 




0.04 


ms 




Mesh level 2 


n.a. 1 




0.10 


ms 




Mesh level 3 


n.a. 1 




0.66 


ms 




Convert to CRS format 












Mesh level 1 


n.a. 1 




0.72 


ms 




Mesh level 2 


n.a. 1 




2.06 


ms 




Mesh level 3 


n.a. 1 




13.17 


ms 




Total 












Mesh level 1 


0.63 


ms 


0.76 


ms 


0.83 


Mesh level 2 


6.37 


ms 


2.16 


ms 


2.95 


Mesh level 3 


121.97 


ms 


13.83 


ms 


8.82 


Solution of the linear systems 












PBCG without multigrid 












Forward solution (V x ,V m ) 


10.12 


s 


736.64 


ms 


13.73 


Adjoint solution (W x ,W m ) 


8.84 


s 


633.92 


ms 


13.94 


PBCG with multigrid 












Forward solution (V x ,V m ) 


5.45 


s 


461.60 


ms 


11.81 


Adjoint solution (W x ,W m ) 


6.02 


s 


508.29 


ms 


11.85 


Computation of measurements 












D J V m 


496.40 


ms 


6.53 


ms 


76.01 


Assembly of the sensitivity matrix 












assembleSensitivity 


16.68 


s 


1.03 


s 


16.23 


Solution of the GauB-Newton system 












(S* k S k + a k I)- 1 


10.87 


s 


331.92 


ms 


32.75 


Total reconstruction time 












Without multigrid 


6.39 


min 


27.39 


s 


14.00 


With multigrid 


5.41 


min 


24.94 


s 


13.01 



On the CPU the system matrices are assembled in one step 



precision operations, we conducted the tests in single-precision on CPU and GPU hardware, 
and also in double-precision on the CPU. 

First, the outcome of the forward operator (i.e. the simulated measurement data) is compared 
for the true fluorescence distribution on the mesh with with 200835 elements. The errors are 
related to the largest measurement datum such that the numerical error can be compared to 
the data noise easily. Given the single- and double-precision CPU and GPU measurement data 
./#C5> -^cd and .^gs the discrepancies are 

|I^CO-.^|l =4 _ 71xl0 -5 



max \Mcd 
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and 

1^## =7.63X10- 7 . 
max (.Mcd) 

As can be seen, the numerical error is far beyond the measurement noise of 1 % of the largest 
measurement datum. For practical applications, the "noise" will be determined mainly by the 
measurement setup (e.g. inaccuracies in the optode locations), model errors and the detector 
noise, whereas the additional numerical noise is negligible. As a consequence, the same regu- 
larization parameter (which has to be chosen depending on the noise level) can be used for both 
the CPU and GPU implementation. 

To give evidence for the similarity of the resultant images, the relative errors based on the 
L 2 norm ||e||^ 2 := f&c(x) 2 dx of the reconstructed concentrations are compared. Denoting the 
reconstructed fiuorophore concentration from the single- and double-precision CPU and GPU 
implementation with ccs, cqd and cqs, these relative errors are 

\\ccd-ccs\\/\\ccd\\ = 6.94 x 1(T 3 and 
lkcD-c GS ||/||ccD||-2.39xlO- 3 . 

The resultant images differ by less than 1 % which we consider to be sufficient for practical 
applications. 

4.6. Memory requirements 

In Table 3, we list the memory requirements for the essential data structures on the finest finite 
element mesh. The additional memory required for the coarse level matrices required for the 
multigrid preconditioner is negligible. 



Table 3. Estimated and True Memory Required (in Mega Bytes; MB) for Selected Data 
Structures of the Reconstruction Algorithm with Single-Precision Floating-Point Numbers 



Data structure 


Estimate 


MB 


Local element matrices K, M, R 


3 -(4-4) xN T 


31.88 


Assembled matrices A x , A m , G 


depending on connectivity 


15.31 


Source vectors Q 


ns x Ny 


3.17 


Detector vectors D 


nd x Ny 


3.17 


Forward solutions V x , V m 


2-nsx Ny 


6.35 


Adjoint solutions W x , W m 


2-ndxNv 


6.35 


Sensitivity S 


(nd ■ ns) x Nj 


382.50 


Total 




448.73 



In our implementation, the sensitivity matrix requires most of the GPU memory. To avoid 
excessive memory consumption, the product of the sensitivity matrix with a vector can also be 
computed on-the-fly, without the need to store the whole sensitivity. Also the complexity of the 
multiplication with the sensitivity (now 0(ns ■ nd ■ Nj)) can be reduced, i.e. it has been demon- 
strated in [19] that the application of the sensitivity can be realized in 0(ns-Ny) operations. 
Such an approach, however, requires the solution of the adjoint problems for every multiplica- 
tion with the sensitivity matrix. Comparing with the computation times in Table 2, this would 
drastically slow down the computations in practice, at least for the values of ns, nd, and Nj, 
Ny used in our test study. If the number of sources or detectors becomes very large (e.g. if a 
camera is used as detector), such an alternative might become favorable. 
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5. Discussion 



In this article, we considered the image reconstruction in fluorescence tomography by itera- 
tive regularization methods of Newton-type and a discretization by finite element methods. We 
demonstrated that a combination of state of the art numerical methods, the utilization of modern 
hardware, and a hardware optimized implementation allows to solve this large-scale nonlinear 
inverse problem within a few seconds on standard desktop PCs, without compromising accu- 
racy. In particular, the extensive use of the GPU's computing power resulted in a significant 
speed-up of the reconstruction process by factors of approximately 10-15. Such a degree of 
acceleration is in good agreement to other work [34, 35]. 

The use of unstructured meshes for our finite element simulations provides high flexibility 
with respect to geometry and facilitates the mesh generation process. A further acceleration 
of the computations could be achieved, however, by utilizing structured grids, which lead to 
structured sparsity patterns of the system matrices and allow faster access of global memory, 
cf [36]. 

In contrast to previous work [4,5], we have presented algorithms which are fully iterative in 
nature and do not require matrix inversion libraries such as CULA, for example. Additionally, 
we ported the whole reconstruction algorithm to the graphics card and use the CPU only as 
controller which avoids expensive memory transfer between the host and the GPU. 

The pre-computation of the mass- and stiffness-matrices of the individual elements and the 
creation of prolongation and restriction matrices for the mesh hierarchy are the only tasks which 
are still executed on the CPU. However, these setup steps have to be computed only once and 
could in principle also be stored along with the mesh data. 

To achieve efficient code on graphics hardware, care has to be taken when accessing global 
memory. Memory access is most efficient, if a sequence of threads reads or writes to a se- 
quential memory area (coalesced memory access) [33]. Therefore, the photon field matrices 
V x , V m and W x , W m , but also the associated source and detector matrices Q and D are stored in 
row-major order. As one column of these matrices belongs to one photon field, this means that 
the elements of a given photon field are interleaved. In combination with the blocked conju- 
gate gradient method, internal performance measurements have shown that this memory layout 
provides speed-ups by a factor of roughly 2-3 on CPUs and up to 10 on GPUs. 

The lack of fast double-precision floating point operations on current graphics hardware 
limits the accuracy of the reconstructions to some extent. In our numerical examples, the relative 
errors in the forward simulation due to the single-precision arithmetic was less then 1 % for both 
the CPU and GPU reconstructions. Thus, the numerical errors can be expected to be dominated 
by measurement noise and model errors and single-precision arithmetic may well suffice for 
this kind of (severely ill-posed) inverse problems. 

The choice of an appropriate sequence of regularization parameter Cfy crucially determines 
the performance of the iterative reconstruction algorithm. Following the analysis of the itera- 
tively regularized GauB-Newton method [12, 15], Cfy should decay exponentially, e.g. Cfy = <Xoq k 
for some 0 < q < I. The choice of Ofo and q, however, depends on the problem. While in prin- 
ciple, appropriate values can be found by a careful analysis of the problem under investigation, 
a more practical way to tune these parameters is to perform a sequence of reconstructions for 
known solutions and select regularization parameter that work well for all cases in this training 
set. The same set of parameters can then be used for similar problems. 

The choice of the stopping index maxit is another crucial ingredient for the numerical 
algorithm. If ao and q have been fixed, and Cfy = OCoq k , then this amounts to choosing a minimal 
regularization parameter a, m „ := a^it. A rigorous way to choose a, ra „ (respectively maxit) 
is provided by Morozov's discrepancy principle [37, 38], i.e. the reconstruction algorithm is 
stopped, as soon as \\^(ck) — || ~ 8, where 8 is a measure for the measurement and 
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model errors. Another rigorous way would be to choose a ~ 8. Both choices however require 
the knowledge of the level 8 of the perturbations, which is usually not known in practice. A 
heuristic way to choose a,,,,,, without knowledge of 8 is provided by the L-curve method [38,39] 
or generalized cross-validation [40]. If a series of similar problems is considered, then maxit 
(or a m j„) can again be obtained by "training" the algorithm on a small data set with known 
solutions. This strategy is probably the most practical one and also the method we utilize in our 
experiments. 

In our opinion, the fast and reliable image reconstruction is a core requirement for the ac- 
ceptance of new imaging modalities like fluorescence tomography in pre-clinical and also en- 
gineering applications. GPU-accelerated algorithms can result in reconstruction times in the 
order of a few seconds, which allows the inspection of the results on-the-fly and, thus, certainly 
facilitates the adoption of this technique in a non-research environment. 
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